WPS data sample Quality Control (QC)

Following the RNAi identity QC, we should already corrected all wrong clones and ready to move to the second QC step, sample qualities. Sample quality QC identifies bad samples that behave unexpected using Exploratory Data Analysis (EDA). For instance, a sample that decorrelates with the other two biological replicates for the same RNAi condition. This EDA based QC is a standard step of RNA-seq analysis and we follow the most common way - using PCA, correlation and distances. This walkthrough provides a convenient template to generate these QC plots for identifying bad samples.

Both standard and custom application of WPS should go through this sample quality check, and remove all bad samples before proceeding to the biological interpretation phase (such as a DE analysis).

Required packages

# install if not in your environment
library(stringr)
library(ggplot2)
library("GGally") 
library(DESeq2)
library(ggrepel)
library(pheatmap)
library(png)

Running QC

The following code uses a wrapper function to generate EDA plots for each condition. Please note that we offer this function just for the convenience to seamlessly analyze WPS data. Please feel free to perform the EDA/sample quality check in your own way!

source('RNAi_QC_functions.R')

# we use met7 as an example
exps = c('met7')
libs = c('lib1','lib2','lib3','lib4','lib5','lib6')
# specify the ID (name in the treatment covariate) for control samples
controlID = 'x.vector' # in WPS, we conveniently call them 'vector'. If you use another name such as "control", change it to the actual name. 
# loop through each plate and then each library 
for (expID in exps){
  for (libID in libs){
    EDA_plots(expID,libID, controlID)
  }
}
## converting counts to integer mode

## converting counts to integer mode
## converting counts to integer mode
## converting counts to integer mode
## converting counts to integer mode
## converting counts to integer mode

Inspecting QC results

A natural thought for bad quality samples are those appearing to be obviously different than its replicates and such difference is huge and cannot be explained by biological variations. In theory, we can train a classifier or use some cutoffs to programatically identify samples with such behavior. However, we also realize that samples can go awry in various ways and it is always good to take a look at the data before going into any biological interpretations. Therefore, we opted to ask user to interpret these QC plots manually as we did to be certain on how the experiment went. Let’s look at an example EDA result using met7-lib4.

The figures are saved in a single output pdf: met7_lib4_initial_EDA.pdf

The EDA results include three plots for each condition versus control:

  • PCA plot: A PCA plot generated by vst transformed read counts for all genes with batch effects from replicates removed. We generally expect three RNAi samples to be separated from the six control samples when there is a significant transcriptional response. When a sample goes bad, it usually appear as a strong outlier in the PCA plot.
  • distance heatmap: A heatmap showing the euclidean distances between samples, using the same vst data as for PCA. Having distance visualized is to ensure outliers in PCA plot reflects a good global difference. Empirically, we found a bad sample usually have distance > 50.
  • pairwise scatter plot: This is a critical plot showing the correlating between replicates. We found that bad samples usually appears to be highly decorrelated, especially in the region of highly expressed genes (so the points are even more scatter out for genes that are highly expressed). Empirically, we found bad samples usually have r^2 < 0.95, i.e., corr < ~0.9746.
    • We are aware that it is often observed that replicates can decorrelate on only a few genes, usually at the bottom left corner. These genes seem usually related to male functions. We consider it normal to observe and do not recommend removing a sample based on this. It is more likely to be biological variations and DE analysis will take care of it.

An example of bad samples is x.him_14_rep2 in met7-lib4. We noticed that it appeared as a strong outlier in PCA and distance heatmap plot, and its correlation with the other two replicates are not as good as typical ones, especially with a decorrelation at highly expressed region. These together suggest excluding x.him_14_rep2 from the downstream analysis as it may have technical problem such as mRNA degradation in the experiment.

Notably, the frequency of bad samples is extremely low with our final WPS protocol. So, one should not expect removing many samples in the QC step. Instead, it is more for gaining confidence in the data and spot any unexpected phenomenon.

Labeling bad samples for exclusion

When inspecting the plots, one should interactively identify and label out the bad samples. These samples will be excluded in our next step of saving final cleaned dataset. For convenience, we also provide a wrapper function to seamlessly label out these samples.

# manually set the bad sample set for each library. Still run this code with empty bad set even if there is no bad samples. This is to be seamless and organized. 
expID = 'met7' 
libID = 'lib4'
badSet = c('x.him_14_rep2') # put all bad samples in the corresponding library here  
# if there is no bad sample, put an empty badSet by badSet = c()
qLevels(expID, libID, badSet)

Now, we are good to do the final step of data cleaning/QC - save the clean data.